In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport
import sweetviz as sv
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
0. Loading & Verifying Sample Dataset¶
In [4]:
# Loading the dataset
# df = pd.read_csv(r"..\data\complaints.csv")
In [3]:
# Loading my 300k sample data
df = pd.read_parquet("../data/processed/cfpb_sample_300k.parquet")
In [5]:
print(f"Loaded {len(df):,} rows & {len(df.columns)} columns")
df.head()
Loaded 300,000 rows & 23 columns
Out[5]:
| Date received | Product | Sub-product | Issue | Sub-issue | Consumer complaint narrative | Company public response | Company | State | ZIP code | ... | Date sent to company | Company response to consumer | Timely response? | Consumer disputed? | Complaint ID | year_quarter | geo | region | stratum | sample_n | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-03-14 | Bank account or service | Checking account | Making/receiving payments, sending money | None | None | None | BANK OF AMERICA, NATIONAL ASSOCIATION | ND | 58503 | ... | 2012-03-15 | Closed with relief | Yes | No | 35052 | 2012Q1 | ND | Midwest | Bank account or service|2012Q1|Midwest | 4 |
| 1 | 2012-03-20 | Bank account or service | Checking account | Problems caused by my funds being low | None | None | None | TCF NATIONAL BANK | MN | 55125 | ... | 2012-03-21 | Closed with relief | Yes | No | 37573 | 2012Q1 | MN | Midwest | Bank account or service|2012Q1|Midwest | 4 |
| 2 | 2012-03-22 | Bank account or service | Checking account | Making/receiving payments, sending money | None | None | None | WELLS FARGO & COMPANY | MN | 55110 | ... | 2012-03-23 | Closed without relief | Yes | Yes | 39793 | 2012Q1 | MN | Midwest | Bank account or service|2012Q1|Midwest | 4 |
| 3 | 2012-03-07 | Bank account or service | Checking account | Making/receiving payments, sending money | None | None | None | Synovus Bank | OH | 44108 | ... | 2012-03-16 | Closed without relief | Yes | No | 34571 | 2012Q1 | OH | Midwest | Bank account or service|2012Q1|Midwest | 4 |
| 4 | 2012-03-20 | Bank account or service | Checking account | Problems caused by my funds being low | None | None | None | PNC Bank N.A. | PA | 18944 | ... | 2012-03-23 | Closed without relief | Yes | Yes | 37047 | 2012Q1 | PA | Northeast | Bank account or service|2012Q1|Northeast | 6 |
5 rows × 23 columns
It seems that the sample dataset loaded properly.
1. Exploratory Data Analysis (EDA)¶
In [14]:
profile = ProfileReport(
df,
title="CFPB 300k Sample - YData Profiling Report",
explorative=True # richer, but still reasonable runtime
)
In [15]:
profile.to_notebook_iframe()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
100%|██████████| 23/23 [00:43<00:00, 1.90s/it]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
In [ ]:
# Save to HTML
profile.to_file("../reports/cfpb_300k_profile.html")
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
1.1 Key Univariate Insights¶
Key Variable Insights
| Variable | Top insight |
|---|---|
| Product | Credit reporting (61.5%) |
| Issue | "Incorrect information": 42% of complaints |
| Company | Top 1: 26.1% share (Equifax?) | Timely response | 98.2% 'Yes' | | Date received | Right-skewed: Recent surge 2025+ |
Correlation
- Product "Credit reporting" ↔ Issue "Incorrect info" (0.85 - High)
- Region "South" ↔ Debt collection (0.25 - Moderate)
- Narrative length ↔ Timeliness (-0.05 - Mild)
Therefore Credit reporting + South region --> peak complaints.
1.2 South vs. Others¶
In [9]:
# South vs rest (your >50% sample focus)
south_df = df[df['region'] == 'South'].copy()
other_df = df[df['region'] != 'South'].copy()
In [12]:
# Generate comparison report
sweet_report = sv.compare([south_df, "South"], [other_df, "Others"])
| | [ 0%] 00:00 -> (? left)
In [13]:
# Notebook iframe
sweet_report.show_notebook(scale=0.9)
In [11]:
sweet_report.show_html('../reports/south_vs_others_sweetviz.html')
Report ../reports/south_vs_others_sweetviz.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
Top 5 South differentiators
| Rank | Feature | South Signal | Insight |
|---|---|---|---|
| 1 | Product | Debt collection +15% | ✅ Economic distress confirmed |
| 2 | Issue | "Debt not owed" +12% | Collection harassment hotspot |
| 3 | State | TX/FL/GA dominate | Sunbelt concentration |
| 4 | Company | Regional banks ↑ | Local players struggling? |
| 5 | Timeliness | -1.2% (97.8% vs 99%) | Ops lag detected |
Key distributions
1. South HIGHER:
- Payday loans (+8%)
- "Late fees" issues (+6%)
- ZIP codes: 7xxx, 3xxx (TX/FL)
2. South LOWER:
- Student loans (-5%)
- Credit cards (-3%)
3. Correlations:
- South × Debt collection = 0.28 (strong).
Hypothesis Scorecard
| My Expectation | Sweetviz Says | Verdict |
|---|---|---|
| Debt collection ↑ | +15% | ✅ STRONG |
| Timeliness ↓ | -1.2% | ✅ Mild |
| Payday ↑ | +8% | ✅ Confirmed |
1.3 Key Findings¶
South = Debt collection crisis
- +15% vs others
- Regional ops 1.2% slower
- TX/FL ground zero